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ABSTRACT 



A database-management system (10) generates bound- 
ed-disorder indexes on its database keys. In such an 
index, the leaf nodes (51, 62) are large and are divided 
into a number of buckets (52, 54, 56, 58), only one of 
which ordinarily is accessed in any given single-record 
database operation. The key values in a leaf node are 
distributed among the leaf node's buckets in accordance 
with a hashing function. The lockable ranges locked for 
scanning functions are defined in accordance with key- 
valued locking, in which each lockable range is 
bounded by successive key values that exist in the data- 
base. But the multiple-bucket accesses that would other- 
wise be required, because of the hash-function distribu- 
tion of key values among a node's several buckets, are 
avoided because the lockable ranges are defined by the 
sequence of key values in the bucket rather than in the 
node. In addition to the existing key values, moreover, 
the buckets' key-value limits are also employed to 
bound lockable ranges, even if no database records 
contain those key-value limits. This prevents end-of- 
bucket insertions and deletions from needing further 
I/O operations in order to identify the lockable ranges 
that those insertions and deletions modify. 

9 Claims, 4 Drawing Sheets 
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group of DBMS modules to which we will refer as a 

KEY-RANGE LOCKING WITH INDEX TREES "query compiler" 16. 

The definer may write the transaction definition in a 
BACKGROUND OF THE INVENTION general-purpose language such as PL/1, but he will 

_ . „. 5 typically embed in the PL/1 program instructions writ- 

The present invention is directed to resource-man- ^ m ^ m h . level database language , such ^ SQL, to 
agement systems and m particular to the locking strate- whkh ^ DBMS responds This can be done in PL/1, 
gies that they employ. for mstance( by preceding, say, SQL commands with 

A resource-management system is typically imple- «exeC SQL," which is a signal that the overall PL/1 
mented in a computer, including its various types of 1Q source code ^ have to be submitted to a precompiler 
storage apparatus, programmed with appropriate soft- before PL/1 compilation so as to strip out the SQL 
ware. One type of resource-management system, statements and replace them with appropriate subrou- 
namely, a database-management system ("DBMS") can ^ ca j] s The stripped-out SQL statements would be 
be thought of as operating in two phases, the first of presented to the query compiler, which would compile 
which may be called a design phase and the second of 15 them into instructions that call upon the services of 
which may be called an execution phase. various operative modules within the DBMS's run-time 

FIG. 1 represents the design phase. In this phase, the supervisor. 
DBMS 10 provides a database designer with ways of A transaction-defining statement in the high-level 
defining the "structure" of the data, i.e., the manner in database language may be in the form: 
which its storage is organized, and of defining transac- 20 START TRANSACTION 
tions in which the data thus stored will be accessed by SELECT INC 
end users. FROM INCOME 

The first function, performed in what we can call a WHERE SSN= 123456789" 
"metadata manager" 12, typically responds to structure- COMMIT WORK; 

defining (data-definition) commands entered by the 25 When the routine specified by such a statement runs, it 
designer in a high-level database language such as SQL. searches the database for a record in which the SSN 
A relational DBMS, for instance, may accept a data- (Social Security Number) field is 123 45 6789 and fet- 
definition instruction of the following form: ches the value in the income field of that record. 

CREATE TABLE INCOME To compile this command, the query compiler 16 

NAME CHAR (20), 30 consults the database catalog, which contains the im- 

INC DECIMAL (1Q2\ plementer's definitions of the data organization, such as 

SSN CHAR (9)* that "INCOME" is a relation and INC and SSN are two 

Such a statement may establish that the database will of its attributes It also- detennines the best manner in 
include a relation, or table, called "income," in which which to search for the mmcated infomation, and it 
7T 7 j • , »*^u„TL fi^e 35 employs the mdex information m order to do this. If, for 

each tuple, or record inc instance, the definer has required that an index ordered 
including a mME field in tie form of twenty .char- social ^ ecuritv number £e provided, then the query 

acters, an "INC" (income) field of ter 1 decimal ^digits J > transaction routine that accesses 

vrtfa 1a decimal pomt two places ^from the right, and an mdex fa such asituationj ^ 

"SSN" (social security number) field in the form of nine ^ S§N {$ ^ „ key „ by which ^ systfim ^ 

characters. the desired record (or records). 

The database = designer r^y use sunilar statements to ft shomd hasized at ^ point that, although 

define other tables that the database will include. He we d ict ^ compi ier as operating in the design 
may further define one or more indexes, whose mainte- - e rather ^ in ^ pnase> many of what 

nance he knows will be valuable m the use of the data- 45 wi]1 be described below as features of the query corn- 
base. For instance, he may employ a command such as: « m Qn] m most DBMS implementations 

CREATE INDEX ON INCOME (INC); ^ ^ phase ^ m progress. In particular, 

to create an index of the records ordered by the values ^ typical query compiler places in the transaction 
of the respective records' income fields. routine 17 calls to operation subroutines chosen from a 

The result of such commands is to cause entries re- 50 su b r0 utine set resident only during the execution phase 
fleeting them in a database catalog 14 that the DBMS ^ commonly called by most transaction routines, 
creates and maintains for use in its access operations. Since the contents of these subroutines are all part of the 
Note that none of the foregoing activities provides ac- translation from transaction definition to implementa- 
tual database contents; attributes have been named, but ^on instructions, we consider them features of the 
no values of those attributes have necessarily been en- 55 query compiler, and the remaining discussion will not 
tered. distinguish between the functions that the query com- 

We will assume for present purposes that it is the end p ii er performs directly during the design phase and 
users who supply the database contents by various ma- those that its design-phase actions perform only indi- 
nipulating transactions. However, it is the database rectly, by incorporating calls to existing subroutines 
designer who defines the types of transaction routines 60 resident during the execution phase, 
that the end users invoke for this purpose. (Actually, a Among the capabilities of many DBMSs is that of 
separate application programmer may perform this task. maintaining what is known as "serializability." In writ- 
From here on, we will refer collectively to the various ing transaction definitions, the database definer is defin- 
personnel, other than the end users, who operate on the ing what will happen when an end user submits a re- 
database as the "definer.") That is, the definer generates 65 quest to the system. One resulting transaction routine, 
transaction routines, which the DBMS's run-time sys- for instance, might be invoked by a user at an automatic 
tern will perform in response to requests from the end teller machine to transfer money from one account to 
user. To produce these routines, the definer employs a another. Another might be invoked by a bank executive 
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from a management-information-system terminal to 
request the total of all account balances for a given 
branch of the bank. In each case, the end user submits 
his transaction request to a run-time supervisor 18 (FIG. 
2), which calls the specified transaction routine. The 5 
transaction routine obtains records from a storage me- 
dium 19, such as a magnetic disk or cached copies of its 
contents, through the operation of a buffer manager 20. 
The actual central-processor time involved in each of 
these transactions may be very small, but the time re- 10 
quired between the beginning and end of the transaction 
may be considerable, since time is usually required to 
await data from storage facilities, inputs from the 
human user, etc. In the operation of most large data- 
bases, it is therefore important that the central processor 15 
be freed to perform operations of other transactions 
between the individual operations of any single transac- 
tion. But this interleaving can cause problems if steps 
are not taken to prevent them. 

For example, the ATM user's transfer of money from 20 
one account to another may actually be implemented in 
a number of operations, which may be interspersed with 
inputs from the user. The transfer involves the separate 
steps of removing money from one account and adding 
it to another. Similarly, the bank executive's transaction 25 
of obtaining account totals may include a number of 
separate operations, each of which involves reading an 
account balance and adding it to a running total. If these 
operations are interleaved, the account-totaling opera- 
tion may copy the balance from the ATM user's first 30 
account before the transfer and from the second ac- 
count after the transfer, and this would indicate to the 
executive that the bank total is greater than it actually 
is. 

This would not result, of course, if the two transac- 35 
tions occurred serially. One of the functions of most 
DBMSs is therefore to perform transactions in such a 
way that concurrently performed sets of them are seri- 
alizable, i.e., that their results are the same as that of a 
series of nonconcurrent transactions, without requiring 40 
actual serial transaction ordering. To this end, the query 
compiler usually causes the transaction routine to in- 
clude certain operations that invoke the services of a 
DBMS module known as a "lock manager" 21, which is 
active in the execution phase depicted in FIG. 2 (and 45 
actually in the design phase, too, for purposes not rele- 
vant here) and maintains a lock table 22, whose contents 
indicate which "resources" are currently involved in 
transactions in such a manner that certain operations on 
them by other transactions must be postponed until the 50 
previous transactions have been completed. That is, if a 
transaction performed by such a routine includes an 
access to a certain resource, it will also request that the 
lock manager post a lock in the lock table identifying 
the designated resource as one to which access is re- 55 
stricted. (We use the more-general term resources in- 
stead of records for reasons that will become apparent. 
Until those reasons do become apparent, however, little 
harm results from reading "records" for "resources.") 

When an end user invokes a transaction routine dur- 60 
ing this execution phase, that routine will request that 
the lock manager post a lock on the required resource, 
and the lock manager will return to the routine an indi- 
cation of whether such a lock is permitted. If not — be- 
cause another transaction has already locked that re- 65 
source — the transaction that has requested the lock will 
be at least temporarily prevented from proceeding. 
Otherwise, (except in certain "instant lock" cases that 
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will be described below) the lock manager will enter a 
lock in the lock table and thereby restrict other transac- 
tions' access to the locked resource. 

Of particular interest in the present context is what is 
known as "range locking." The serializability of some 
types of transactions is not assured by simply locking 
the several records to which they require access. An 
example of such a transaction is one that includes a scan 
operation, which accesses all records in which a certain 
attribute's values are within a specified range. 

For instance, one transaction may be to find the total 
income within an income range. A concurrent transac- 
tion may be to insert records for a group of new taxpay- 
ers who first have reportable income within a given 
period. It is desirable that the computed total reflect 
either all or none of the new taxpayers, but individual 
record locking does not insure this result. If the inser- 
tion transaction inserts one record in a part of the range 
that the total transaction has already searched and an- 
other in a part that it has not, then neither transaction 
will encounter the other's locks, and the total-comput- 
ing transaction's result will be erroneous. This is be- 
cause of the "phantom" records that were inserted into 
a range after that range's original occupants were 
locked. 

To remedy this problem, database management sys- 
tems must lock entities other than just records. For 
example, a DBMS may lock a whole file when a scan 
operation is performed in it. But such an expedient can 
produce a considerable reduction in the system's 
achievable concurrency, i.e., in the degree to which 
operations of separate transactions can be interleaved. 
Alternatively, therefore, DBMSs sometimes employ 
range locking, which treats not only individual key. 
values but also key- value ranges as resources to be listed 
in the lock table. Therefore, when a transaction routine 
includes an operation directed to all records within a 
certain target key range, it causes the lock manager to 
post locks directed not only to the individual key values 
to which the transaction will obtain access but also to a 
set of Iockable ranges that covers the target range. And, 
if another operation is directed to a specific key value, 
that transaction not only requests a lock on that key 
value but also has the lock manager check for locks on 
any Iockable range into which that key value falls. This 
latter function is performed by additionally identifying 
any such range resource and requesting a lock on it. 

The routine for inserting records of new taxpayers, 
for instance, would identify all ranges into which the 
records that it intends to insert fall, and it would check 
the lock manager to determine whether there are locks 
on these ranges. Since the total-computing transaction 
will have acquired locks on ranges into which the new 
taxpayer records would otherwise have been inserted, 
the record-insertion transaction is postponed until the 
total-computation transaction has been completed, and 
serializability is thereby maintained. 

We digress at this point to note that locking a key 
value or range thereof is not the same as locking the 
record or records that the key value or range desig- 
nates. In requesting a lock at the beginning of a search 
by key value, the operation passes to the lock manager 
a "resource ID," which is typically a hashed or other- 
wise encoded version of the key value. It is this ID 
against which the lock manager posts a lock and for 
which it searches to determine whether the lock can be 
granted. Now, the record identified by the name-attrib- 
ute value John Doe may be the same as that identified 
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by the social-security-number-attribute value 123 45 Now, suppose that a user calls for performance of a 

6789. But the key-value locking performed by an opera- transaction routine for totalling all incomes within the 

tion that uses John Doe as its key value and uses (typi- S40,000-$60,000 income range and that the database 

cally a hashed or otherwise encoded version of) that designer has specified two predetermined ranges of 

value as the resource identifier when it acquires a lock 5 $30,000 to $50,000 and $50,000 to $70,000. The DBMS's 

on that name does not by itself restrict access to that run-time supervisor responds to this request by running 

record by an operation that requests a lock on the social the requested transaction routine and assigning this 

security number. Typically, of course, the operation instance of that routine's operation the transaction name 

will also perform further locking, such as record-ID Ti. Since transaction Ti will need to read all records 

locking, which will cause access to that record from 10 within the two ranges but not to update any of them, it 

other paths to be restricted. Indeed, one of the keys may requests S locks on the two ranges, as the first row of 

in essence be the record ID. But we are not concerned the FIG. 4 table indicates. The S lock is a covering lock, 

with that here. Here we are concerned only with lock- since it implicitly locks each key value m the range But 

ing on the basis of a given key or range thereof, so we * e transaction does not explicitly place a separate lock 

wm refer to key-value or range locking, which may or " on each key value-i.e., each income level-that it 

may not be accompanied by or equivalent to record fmds ' since « uerv com P der ^ have «? 

. f . r any transaction routine that similarly searches for re- 

^Although it is better than locking the whole file, ^ b V 48 to *f 'j** m ™ a S er to 

locking 4ges of key values can itself reduce concur- ^ ^^T^ 7^ 

6 . .if ^ rf- i « 20 The "read scan" entry in the FIG. 3 mode table re- 

«ncy significantly. T ^J^^ fleets this lock-mode selection. It shows that, upon a 

fcon DBMSs ^employ more ^an one SC an-type read operation, i.e., one which requests all 
tack mode differing from the others in restnctiveness. r€ ^ P witMll /^ wen ^ a lock of ^ S type is 
This restnctiveness variation is useful even if the ^ Qn ^ ^ Qf T J bm nQt Qn ^ 

DBMS does not use range lockmg For example, a lock 1$ in 7 ividual key values . 

acquired by a transaction as a result of an operation that As FIG 5 mdicateSf lock mode S is compatible with 
only reads records does not need to prevent other trans- Qmy twQ lQck mQ& ^ IS ^ s ^ ktter compatibility 
actions from reading those same records, but a lock ^ exemplified by a concurrent transaction ("T 2 "). 
resulting from a write operation does. In recognition of Transaction Tl requires access to the S-locked ranges, 
this fact, a simple, two-mode locking system may em- ^ but Qmy t0 read theni) so it requests S-mode locks of 
ploy lock-table entries that include an indication not ranges xh e i^k manager, which implements the 

only of whether a lock is in place but also of whether compatibility table of FIG. 5, informs T 2 that the re- 
the lock is a share-mode ("S") lock, requested for read ques ted lock mode is compatible with the existing (S- 
operations, or an exclusive-mode ("X") lock, requested mode ) locks on those ranges in its lock table. It there- 
by record-modifying operations. A transaction request- 35 fore posts ^ lock( ^ ^ fjq 4 j ock indicates in 
ing an S lock will be deterred only by X locks on the ^ T2 row 

target resource, not by S locks, while transactions that To observe compatibility of the S mode with the IS 
request X locks will be deterred by locks of both types. mo de, let us consider a transaction ('T3") whose pur- 
Database systems that employ range locking often ^ to f etc h, say, the identity of the taxpayer whose 
obtain additional concurrency by means of "multi- 4q income is $39,547. This involves a "singleton read" 
granularity locking,** or MGL, which conventionally operation, i.e., one that targets an individual key value, 
employs five lock modes, as FIGS. 3, 4, and 5 illustrate. not a range, although the key value may fall within a 
FIG. 3 is a lock-mode table, which indicates the types of range. As FIG. 3 indicates, the query compiler requires 
locks acquired by transactions in accordance with one sucn ^ operation's transaction to request not only an S 
application of the MGL scheme for the different types 45 i 0 ck on the actual key value to which this operation is 
of database-access operations of which they may be directed but also an IS lock on any range in which that 
comprised. In addition to the simple "covering'* lock key values falls. Accordingly, as FIG. 1 shows, T3 ac- 
modes S and X, there are three "intention" lock modes, quires an S lock on income value $39,547 and an IS lock 
designated IS, IX, and SIX, whose purpose is to indi- C n income range $30,000 to $50,000. The lock manager 
cate, with respect to a key range, that a further, cover- 50 grants this lock, even though locks have already been 
ing lock will be acquired on a key value that falls within acquired on that income range, because the lock manag- 
that range. (Actually, the SIX mode, as will be seen e r's compatibility matrix (FIG. 5) indicates compatibil- 
below, is both a covering lock and an intention lock.) ity between the IS and S locks. 

FIG. 4 is an exemplary lock table that will be em- The IS lock is not a covering lock: the operation that 
ployed to explain intention locking. In the example, we 55 caused it is not directed to the range against which the 
will assume that one of the attributes in a relation de- IS lock is posted. Instead, it is an intention lock: it indi- 
fined for a given database is labeled "INC,*' for "in- cates the intention to acquire a lock on a different re- 
come." That is, if the relation is thought of as a table, source, namely, a key value, that the range includes (or, 
each row of the table represents a taxpayer, and one of for some resources, otherwise overlaps). Its purpose is 
the columns represents those taxpayers* incomes. Let us 60 to prevent another transaction from acquiring a cover- 
further assume that the database designer has required ing lock on the range that would be inconsistent with 
(not necessarily in principle but nearly unavoidably in the first transaction's access to the included (or over- 
practice) that an index be maintained of incomes and lapped) resource. 

that he has further specified that the DBMS is to treat The difference between the IS and S locks becomes 
certain predetermined ranges of incomes as lockable 65 apparent when one considers a new transaction T4, 
resources. The reason for this may be that a purpose of whose purpose is to correct the income value for the 
the database is to support studies of characteristics of taxpayer whose income is $41,290. This is a "singleton 
various predetermined income ranges. update," and FIG. 3 indicates that the involved range 
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and key value must be locked in modes IX and X, re- 
spectively. FIG. 4 shows that no previous transaction 
has obtained a lock on the key value of $41,290. But that 
value falls within the $30,000 to $50,000 range, on 
which previous transactions have already acquired 
locks, and the mode, S f of two of those locks is incom- 
patible with the IX-mode lock that transaction T4 re- 
quests on that range. Transaction T4 therefore cannot 
proceed until the S-lock-holding transactions Ti and T2 
have ended, either by aborting or by running to comple- 
tion, i.e., in database parlance, by "committing." 

Now, let us suppose that transactions Ti and T2 com- 
mit. A committing transaction notifies the lock man- 
ager, which responds by removing the transaction's 
entries. FIG. 4 uses check marks to represent such re- 
movals for transactions Ti and T2. The only lock re- 
maining on the $30,000 to $50,000 range after transac- 
tions Ti and T2 commit is therefore the IS-mode lock 
acquired by transaction T3. Remember that transaction 
T3 requested only an IS-mode lock on the range, since it 
did not need to read all of the range but only needed to 
indicate that it would be reading something within the 
range. Thus, as FIG. 3 indicates, the IS mode is compat- 
ible with the IX mode requested by transaction T4 on 
the $30,000 to $50,000 range. After transactions Ti and 
T2 have committed, therefore, transaction T4 can ac- 
quire the locks that it requires before it will proceed. 

It can be appreciated that range locking of this type 
eliminates the phantom problem yet affords consider- 
able concurrency. As the last example indicates, how- 
ever, the use of such deFmer-specifled ranges is some- 
what inflexible. Specifically, the target income range 
mentioned above did not fit the predetermined ranges 
very closely, and locking enough predetermined ranges 
to cover the target range resulted in less concurrency 
than would have been possible if there had been a closer 
fit. Additionally, the approach to MGL just described 
requires considerable locking overhead, since it requires 
that two resources, namely, a key range and an individ- 
ual key value, be separately locked for most operations. 

A system that deals with these problems is the 
ARIES/KVL system described in Mohan, 
"ARIES/KVL: A Key-Value Locking Method for 
Concurrency Control of Multiaction Transactions Op- 
erating on B-Tree Indexes," Proc. Very Large Databases 
Conference, Brisbane, Australia (August 1990). This 
system dynamically re-defines key-value ranges in ac- 
cordance with the current population of key values. 
Specifically, the system maintains a key-value-ordered 
index, and the possible lockable ranges are the ranges 
between each pair of successive key values that cur- 
rently exist in the index. That is, if the existing key 
values are ki, k2, . . . , k/, . . . such that k,*<k/+ 1, then the 
ranges are the disjoint semi-open intervals (k/, k/+ 1], and 
each such range is identified by the upper bounding key 
value. Equivalently, of course, the range could be iden- 
tified instead by the lower attribute value, in which case 
the range would be open at the top rather than at the 
bottom. Since locking granularity is finer if the ranges 
are disjoint, it is preferable for one end of the interval to 
be open. A single lock-request resource identifier in this 
system identifies both a key value and the range that the 
key value bounds. This enables a transaction to lock 
with a single lock request both a range and the key 
value that bounds it. 

FIG. 6 lists the lock modes required in the 
ARIES/KVL system for the various operations that a 
transaction might include. Unlike FIG. 3, FIG. 6 in- 



cludes only one column for the first four operations 
listed, because an ARIES/KVL operation does not 
separately lock key values and key ranges. Of particular 
interest in the present context, however, is that FIG. 6 

5 does include a second column for insert and delete oper- 
ations. This does not indicate that key values and ranges 
are locked separately for these operations. Instead, it 
represents a separate lock on what will be described 
below as the "next" key value/range. This is necessi- 

10 tated by the fact that inserting and deleting records in a 
KVL system divides or extends existing key ranges. 

Suppose that a record having a key value of k,is to be 
deleted. Clearly, one transaction's deletion of a record 
that has been written, read, or inserted by a second 

15 uncommitted transaction potentially compromises the 
serializability of those transactions. Therefore, an oper- 
ation that deletes k,- should not be permitted if another 
transaction has any lock on k,- regardless of the mode of 
that lock. A deleting transaction accordingly tests for 

20 any such lock by requesting an X-mode lock on k,\ FIG. 
6 includes the "instant" notation to indicate that the 
transaction need only request a compatibility indication 
and not actually acquire the lock, for a reason that will 
presently become apparent. 

25 As FIG. 6 shows, the deleting transaction requests a 
lock not only on the targeted key value/range but also 
on the "next" key value/range, i.e., on key value/range 
ki+i, which includes the range previously represented 
by the deleted key value k/. Since that range has now 

30 been modified, no access to it should be permitted, so 
the transaction acquires an X-mode lock on range k/+ j. 

Insertions, too, require next-range locking. Let us 
suppose that an inserting transaction is to insert a new 
key value k/ between k, and k/+i. This will reduce the 

35 width of the range previously represented by key value 
k,-+i, so range k J+ i must be checked for locks. For this 
purpose, ARIES/KVL uses an instant lock in the IX 
mode if the inserting transaction has not previously 
locked that key /range. Clearly, if that range had been, 

40 say, scanned by another uncommitted transaction, as 
indicated by an S, X, or SIX lock, that range should not 
be modified by inserting a new record into it. Testing by 
means of an IX-mode lock prevents this. However, 
there is no reason why the k/ record cannot be inserted 

45 by one transaction just because another uncommitted 
transaction has previously inserted the k f+ i record, as 
indicated by a previously existing IX-mode lock. Since 
the requested DC-mode lock is compatible with an IX- 
mode lock, such an insert "in front of another insert 

50 can occur. (In front of and behind are defined by the 
direction in which key-valued ranges extend: a range 
extends in front of the key value that represents it.) The 
inserting transaction requests only an instant lock in this 
mode because there is no reason why one transaction's 

55 insertion of k,' should prevent another transaction's 
access to k/+i. 

This locking of the next range, although simple in 
principle, can be time-consuming in practice because of 
the need to identify the next key value, by which the 

60 next range is identified. To understand this requires 
consideration of the manner in which databases are 
indexed. The most popular type of index is the B-tree, 
which comprises a hierarchy of nodes, or sets of contig- 
uously stored index (as opposed to data) records. The 

65 highest-level, or root node in a typical B-tree contains 
N— 1 key values, N>1, sometimes called separators in 
this context, that divide the entire key space into N 
subspaces, with each of which is associated a lower- 
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level node to which the root node contains a pointer. dependence of file-utilization efficiency on the key- 
That is, the separators in one level's nodes represent the value distribution in the database. One particular 
key-value limits of nodes in the next level down. To find method, however, described in Lomet, "A Simple- 
a database record containing a given key value, a search Bounded Disorder File Organization with Good Per- 
routine starts at the root node and determines, by com- 5 formance," ACM Transactions on Database Systems, vol. 
paring the searched-for key value with the root node's 13, no. 4, December 1988, pp. 525-51, yields a single- 
separators, the subspace into which the searched-for record-access performance similar to that of straight 
key value falls. The search routine then follows the hashing methods while affording a range-search capa- 
pointer to the associated lower-level node, which typi- bility mat, at least for reasonably wide ranges, is similar 
cally itself contains a set of separators that further di- 10 to that of a conventional B-tree. 
vide the subspace, in which case it also contains pointers Like some other approaches, the bounded-disorder 
to still-lower-level nodes. This process continues until method employs an organization that is the same as that 
the search reaches a node in the lowest level of the ofa conventional B-tree at the higher levels, but the leaf 
hierarchy. Such a node is referred to as a leaf node. If nodes are much larger. The leaf nodes are large enough 
the search key is a secondary key, each leaf-node record 15 ^ ^ of ^ ^ j evek excep t the leaf level can conve- 
typically consists of a value of that key and the record mer itly be cached in main memory. Each leaf node is 
ID of a database record that contains that key value. In too large to make it practical routinely to read the entire 
the case of a primary index, the leaf-node records are node for single-record accesses, but it is essentially a 
typically the database records themselves. hasn ^ divided ^ bucke ts, to which access is di- 
The database and its indexes are ordinarily stored on 20 rected from the previous ^ i eve i by a hashing func- 
a disk or other non-volatile memory, whose contents ^ 

the host computer's operating system accesses in inte- Xhe search c f a bounded-disorder index accordingly 

gral numbers of pap. ^ie size of memory block, or in a manner essentially identical to that of a 

bucket, by which the DBMS accesses the database is ^ch until it reaches the penulti- 

typicaUy a smgle page or a number of pages small 25 mate node At ^ ^ ^ ^ identifie$ a 

enough not to tax the bandw,dth of the disk-to-buffer ^ nodCf but the leaf node is too large to be read rou- 

channe unduly, and a leaf node comprises a single memory for single-record accesses, 

bucket ma conventional B-tree. If the entire index other T i. i t, u a Af 

*u *i. i r j uu i i : «,™«™ « Instead, the search routine employs a hash function of 

than the leaf nodes could be cached in main memory, a . * e . . . I ^ , . , * 

, rj*u ~ a „ uu , n the searched-for key value to locate a particular bucket 

random access of a database record could be performed 30 . / *: . 

with a single disk access. For most large-sized data- Wlt ^ leaf node ' ^ 11 * o*lythis bucket that is 

bases, however, this is not practical, and the lowest two rea * mt0 ^ memor y t0 obtam searched-for re- 

tree levels must usually remain on the disk. Accord- co £r' , . , tt . t , _ 

ingly, one I/O operation must be performed to reach ^ result f organization is that all of the leaf 

the penultimate-level node, and a second I/O operation 35 no u des « ordered in key value with respect to each 

must be performed to reach the leaf node. (Of course, if other ' but leaf node °, f a plurality of buck- 

the search key is a secondary key, a still further I/O ets ' withm which records are attributed m accordance 

operation must be performed to reach the database wth functlon 80 ther * » no order 

record itself ) among the buckets within a leaf node. Preferably, no w- 

The B-tree organization is the most popular indexing 40 ever » ^ records a bucket m stored ^ 
approach, but another approach, employing a "hashed** ke V order - # 
file, yields better performance in random accesses of Clearly, since the entire index except for its leaf-node 
single records. In accordance with this approach, some level De cached m memory, a single-record file 
function of the search key value (the hash function) is access 0311 ordinarily be performed in a smgle I/O oper- 
used to compute the address of the bucket in which the 45 ation - (Again, a further I/O operation may be required 
record resides that contains the searched-for key value. for a bucket d»t has overflowed.) Since the leaf nodes 
Random access to a single record thus ordinarily re- ordered, however, ranges of key values are local- 
quires only a single I/O operation. (In practice, buckets ^ed to a subset of the leaf nodes, with the result that 
sometimes overflow, and a second I/O operation then range-searching performance approaches that of a con- 
must be performed to obtain records that reside in over- 50 ventional B-tree, at least for ranges that span several 
flow buckets, but this typically happens in only a small nodes. 

percentage of accesses.) A bounded-disorder file organization thus affords 
Although hashing yields superior performance for significant advantages, at least for a single-user database 
random accesses of single records, hash-type indexes system. But one may wish to apply the bounded-disor- 
are somewhat less popular than B-tree indexes because 55 der principle to multi-user database systems, so the 
their performance for range searching is abysmal. Since question arises of how to lock ranges where such an 
the B-tree separators assign unique nodal key-value organization is employed. A little reflection reveals that 
ranges to the leaf nodes, leaf nodes in B-trees are or- a bounded-disorder index lends itself to the type of 
dered by search-key value, but the same is not true of range locking in which the lockable ranges are in a 
the hash-file buckets. In a hash file, therefore, a range 60 sense independent of the key-value population: the lock- 
search will ordinarily require accessing all buckets, able ranges can be defined by the key-value limits on the 
while only those whose key-value ranges overlap the leaf nodes. Accordingly, when a single record is ac- 
target range need to be searched in a B-tree. cessed, the range on which an intention lock needs to be 
Not surprisingly, a number of attempts have been placed is readily identified by the index terms encoun- 
made to achieve the fast single-record accesses of hash- 65 tered in traversing the tree, since the index terms en- 
ing methods while also obtaining the range-search per- countered in the penultimate-level node are the key- 
form ance that simple B-tree organizations provide. value limits of the leaf node and thus of the lockable 
Some have achieved this result at the cost of a heavy range. 
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But the granularity of that approach is too coarse to the number of such accesses tends to be very small, 

be acceptable for many applications, so one would pre- while the reduction in I/O time for a single-record 

fer key-value locking. Unfortunately, conventional key- access is considerable. 

value locking — in which the lockable ranges are In accordance with another aspect of this invention, 
bounded by successive key values existing in the 5 moreover, insertions and deletions can be performed in 
database — has not heretofore been readily implementa- a single I/O operation even when the key value inserted 
ble with a bounded-disorder index. The reason for this or deleted is the last one in a bucket and thus would 
is that successive key values are not in general in the conventionally require a second I/O operation to iden- 
same bucket, so all of the buckets in a node would need tify the next key value. In accordance with this aspect 
to be accessed in order to identify the "next" key-valued 10 of the invention, which is applicable not only to bound- 
range implicated in a delete or insert operation. ed-disorder files but also to conventional B-trees, the 
In a scan operation, it would be acceptable to access key value used to represent the last lockable range in the 
all of the buckets in a node, since nearly as many pages bucket is the bucket's key-value limit, which can always 
would have to be accessed for a similar operation in a be determined without performing another I/O opera- 
conventional B-tree file if the scan operation's target 15 tion. 

range is very great, as it usually is Indeed, the bounded- DESCRIPTION OF THE DRAWINGS 

disorder mdex may provide performance advantages: 

because all buckets in a node ordinarily are stored con- These and further features and advantages of the 

tiguously, a scan in such an index would likely encoun- present invention are described below by reference to 

ter fewer separate sets of contiguous pages, so the I/O 20 the accompanying drawings, in which: 
operations could be completed more quickly. In an FIG. 1 a block diagram that depicts parts of a DBMS 

insert or delete operation, however, conventional key- that are used by a database definer during database 

value ranging results in significant disadvantages for a design; 

bounded-disorder organization. A conventional B-tree FIG. 2 is a block diagram of parts of a DBMS em- 

usually permits the implicated range to be identified 25 ployed to respond to requests from an ultimate user; 
without a further I/O operation, because the next key FIG. 3 is a lock-mode table that characterizes a type 

value is usually in the same leaf node as that in which of DBMS query compiler, 

the delete or insert operation is to be performed. But a FIG. 4 depicts an exemplary lock table maintained by 

bounded-disorder index requires that all of the buckets a lock manager; 

of a large node be inspected before the implicated range 30 FIG. 5 is a compatibility table that characterizes cer- 

can be identified. (It should be observed, however, that tain types of lock managers; 

even B-trees require a second I/O operation to identify FIG. 6 is a lock-mode table that characterizes certain 

the implicated key-valued range if the range bridges a other types of query compilers; 

page boundary.) FIG. 7 is a diagram depicting the structure of a 

35 bounded-disorder index; 



SUMMARY OF THE INVENTION 



FIG. 8 is a diagram of part of the population of the 



According to one aspect of the present invention, several buckets in a leaf node of a bounded-disorder 

only one bucket of a node in a bounded-disorder file index; and 

ordinarily needs to be searched in order to lock a key- FIG. 9 is a diagram depicting the segregation of key 

valued range when an insert or delete operation is per- 40 values between a primary bucket and the overflow 

formed. In accordance with this approach, the lockable bucket in accordance with one embodiment of the pres- 

ranges in a leaf node do not form a single sequence of ent invention. 

non-overlapping ranges but instead form a number of nFTATT FD DF SPRTPTION OF AN 

In other words, a range defined by the key values exist- FIG. 7 depicts a single-record access performed in a 
ing in one bucket of a node will ordinarily overlap bounded-disorder file. The main memory 42 includes 
ranges similarly defined by key values existing in other the upper levels of a primary-key index, including a 
buckets of the same node but will not overlap ranges penultimate-level node 44, which includes, among other 
defined by key values in different nodes. 50 index records ("index terms"), terms 46, 48, and 50. For 
Since the sizes of the ranges will on the average be n purposes of explanation, there is no loss in generality in 
times the range sizes employed in conventional key- considering the penultimate level to be the top level and 
valued locking, where n is the number of buckets per node 44 to be the root node. Accordingly, let us say that 
node, one might expect such an approach to have a a search begins by comparing the searched-for key 
significant adverse effect on concurrency. But it turns 55 value k with separators K in node 44. 
out that this is not the case. Although the average differ- We assume that the searched-for value k falls be- 
ence between two successive existing values k, and k»+ 1, tween K,-_ i and K/. This means that the database record 
in a single bucket is in general n times that between the whose key field equals k is contained in a node 51 to 
values that define a lockable range in conventional key- which the pointer component P, of index term 48 points, 
valued locking, a lock of such a range interferes on the 60 In accordance with the typical bounded-disorder file 
average with single-record operations on only 1/n of organization, the leaf node that the pointer identifies is 
the key values that fall between the bounding values. so large that reading it into main memory would take 
The reason why such an approach works for single- too long to be feasible for routine single-record ac- 
record accesses is that scan operations are required to cesses. Consequently, only a single one of the node's 
lock ranges in all buckets. Although this requirement 65 constituent buckets, namely, the one containing the 
can result in causing a scan operation to access some searched-for record, is read into main memory. For the 
pages to which access would not be required by con- sake of simplicity, we will assume that each node con- 
ventional key-valued locking, the fractional increase in tains only four "primary" buckets, designated by four 
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possible hash values, and that it further includes a fifth, 
overflow bucket. If the (atypically simple and ineffec- 
tive) hash function is simply the two least-significant 
bits of the searched-for key, for example, buckets 52, 54, 
56, and 58 might correspond to values 00, 01 , 10, and 11, 5 
respectively. In the illustrated embodiment, the index 
term 48 includes a size component S, which indicates 
the associated node's bucket size in pages. The result of 
hashing the searched-for key is multiplied by this value 
and the page size, and the result is added to the pointer 10 
value to find the beginning of the bucket to be read. 

FIG. 7 indicates that the size S/has a value of two, so 
two pages beginning at that bucket address are read into 
main memory, and the record having the searched-for 
key is found. (According to this particular file organiza- 15 
tion, the bucket may have overflowed, in which case it 
contains an indication of that fact, and a second disk 
access, of a fifth, overflow bucket 60, may be necessary. 
Since most accesses do not require a second disk access, 
the average number of I/O operations per single-record 20 
access is just slightly over one.) 

In the illustrated organization, the nodes have a fixed 
number of buckets, but the buckets are expandable; 
dashed lines indicate that each bucket of node 51 con- 
sists of two pages, while each one in a further node 62 25 
consists of three pages. Those familiar with bounded- 
disorder file organizations will recognize that this ex- 
pandable-bucket feature is not necessary: a fixed bucket 
size with a variable number of buckets per node is possi- 
ble, as is a fixed bucket size with a fixed number of 30 
buckets per node. However, we believe that the organi- 
zation depicted in FIG. 7 has certain operational advan- 
tages, which are not relevant here but which are de- 
scribed in the Lomet paper referenced above. 

We now turn to the manner in which key-value lock- 35 
ing can be practiced in such an organization. For this 
purpose, we consider FIG. 8, which depicts a portion of 
the key-value space encompassed by node 51. FIG. 8 
comprises five rows of circles and Xs. The circles and 
Xs in the lowest row represent all possible key values in 40 
the portion of the node's key-value space that FIG. 8 
represents. The Xs represent actually existing values, 
i.e., values contained in records currently in the data- 
base, while the circles represent other possible values. 

The upper four rows represent the distribution of 45 
those possible and existing values among the four buck- 
ets, each row being referenced by the reference numeral 
associated with the bucket that it represents. The 
dashed lines in FIG. 8 represent the hashing function in 
accordance with which the possible key values are 50 
distributed among the buckets. Suppose, now, that the 
record containing existing key value k3 is to be deleted. 
In accordance with conventional key-valued-locking 
practice, this would require that the deleting transaction 
acquire a lock on a range bounded by existing key val- 55 
ues kzand k4, which would be successive values existing 
in the node after key value k3 is deleted. To identify this 
range, the transaction would be required to access not 
only bucket 54 but also buckets 52, 56, and 58 so as to 
determine which is the next existing value and thereby 60 
identify the value k*, which is resident in bucket .58, that 
forms the upper bound of that range. After that range 
had been identified, the transaction would then request 
a lock on it. 

In accordance with the present invention, however, 65 
accesses of the other buckets are not required. Instead, 
the insert operation merely searches bucket 54 for . the 
next-higher key value in that bucket and requests a lock 
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on that value, namely, key value k$, which represents a 
range bounded by values ki and ks- 

Now let us suppose that a different transaction is to 
insert into the database a record having key value k*'. 
To preserve serializability, a transaction cannot insert a 
record into a range that results from a deletion by an- 
other uncommitted transaction, so an inserting transac- 
tion must request a lock on the range represented by the 
next-higher existing key value and thereby determine 
whether the target range has been locked by a deletion. 
In conventional key-valued locking, this range would 
be the range represented by key value ks, on which the 
other, deleting transaction has obtained a lock. Conven- 
tionally, therefore, insertion of the record having key 
value kV would not be permitted until the transaction 
that deleted key value k3 has committed. In accordance 
with the present invention, however, the inserting 
transaction does not identify the implicated range by 
searching for the next-higher value in the node: it 
searches for the next-higher value in the same bucket. 
Accordingly, unless a lock has been acquired by a dif- 
ferent transaction on key value k& which is in the same 
bucket 52 as that into which key value k*' is to be in- 
serted, the inserting transaction can proceed. 

The cost of employing parallel sets of overlapping 
ranges is that scan operations must lock all sets of ranges 
separately. For instance, suppose that a scan operation 
is to read all records whose key values are between k2 
and k4. In conventional key-valued locking systems, this 
would require locking two ranges. The first range, rep- 
resented by key value k3, would extend from that value 
down to key value k2- The other, represented by key 
value k4, would extend from that value to key value k3. 
In a conventional, single-bucket-mode B-tree organiza- 
tion, these key values would be in the same bucket in 
most cases, so no extra I/O operations would be re- 
quired to identify the ranges to be locked. Moreover, 
the total range thereby locked is, as was just described, 
only that between key values k2 and k4. In contrast, the 
present invention requires that all buckets in the node be 
accessed and that locks be acquired on the ranges repre- 
sented by key values k3, k4, ke, and k7. The values 
thereby locked would include some as high as k7 and 
some below ki. 

The foregoing example was chosen to emphasize the 
difference between conventional scanning and scanning 
in accordance with the present invention, and it appears 
to suggest that the method of the present invention 
requires considerably more access time for scan opera- 
tions than the conventional B-tree organization does. 
But this is true only for very small ranges. Scans are 
normally performed on ranges that span many buckets, 
and the total cost in I/O time for such scans is low on a 
percentage basis. 

We now turn to another aspect of the present inven- 
tion. Although the benefits of this aspect of the inven- 
tion are particularly pronounced for bounded-disorder 
indexes, it is also applicable to other systems that em- 
ploy key-valued locking. 

In the preceding comparisons with conventional key- 
valued locking, it was noted that finding the next key 
value, which is necessary in order to perform proper 
locking for insert and delete operations, can require an 
additional I/O operation even for conventional B-tree 
indexes in some circumstances. In particular, a second 
I/O operation is required by a delete operation to be 
performed on, and by an insert operation to be per- 
formed beyond, the last existing key value in the node 
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(in the case of a conventional B-tree) or bucket (in the 
case of a bounded-disorder index). In those operations, 
the next key value turns out to be in the next node or 
bucket, so a further I/O operation must be performed to 
determine the key value that represents one of the 5 
ranges that must be locked if serializability is to be pre- 
served. 

According to this other aspect of the invention, how- 
ever, there is no need for the second I/O operation. 
Instead of using only values in existing database records 10 
as lockable-range boundaries, we use each bucket's 
key-value limit as the upper bound of the last Iockable 
range in that bucket. In this way, the upper lockable- 
range boundary of the "next" range that insert and 
delete operations are required to lock can be determined 1 5 
in the initial index traversal or, if each leaf bucket stores 
an indication of its key-value limit, found in the bucket 
itself. 

Suppose, for instance, that FIG. 8 depicts the upper 
end of node 51 and that the key-value limit of that node 20 
is a value ks (not shown) that does not currently exist in 
the database and would fall in the next node if it did. 
Further suppose that key value k4is to be deleted. In the 
absence of this aspect of the invention, the necessary 
locking of the next range would require a separate I/O 25 
operation to identify the first existing value in the next 
node's bucket corresponding to the bucket 58 in which 
value k4 resides. In accordance with the present inven- 
tion, however, a Iockable range is bounded by key value 
ks, even though ks is not an existing key value. Since 30 
this nodal key-value limit can be obtained during the 
index traversal, for instance, next-range locking for the 
deletion of k4 does not require an extra I/O operation. 

The manner in which this approach is implemented is 
not critical. One convenient way is to have the search 35 
routine retain the index term that indicates the key- 
value limit of the node in which the deletion or insertion 
is to occur. Then, if no existing key value is found in the 
bucket beyond the position of the insertion or deletion, 
the operation requests a lock on the range represented 40 
by the bucket key-value limit determined from the 
nodal limit as explained above. Alternatively, the need 
to extract the node's key-value limit from the traversal 
of higher-level nodes can be avoided by having each 
bucket store its key-value limit. 45 

As those skilled in the art will appreciate, the nodes of 
an index tree are sometimes split to generate new nodes, 
and are sometimes combined to reduce the number of 
nodes, as the database contents change. When this hap- 
pens, the affected nodes' key-value limits change. Since 50 
the invention uses key-value limits to define Iockable 
ranges, tree restructuring involves modifying Iockable 
ranges. Use of this aspect of the invention therefore 
imposes the requirement that locks be acquired before 
restructuring occurs. 55 

Suppose, for instance, that a node containing key 
values k/ and k/+i is to be split at a key value k sp ji t be- 
tween k/and k/+i. That is, k^is to become the upper 
key-value limit k// m // of the lower resulting node. Fur- 
ther suppose that the k/+i range has been locked be- 60 
cause an uncommitted transaction has extended that 
range by deleting a key value. Serializability consider- 
ations dictate that range k }+ \ cannot be divided by 
insertion before the deleting transaction has committed 
and thus released its lock on the k/ + 1 range. But splitting 65 
the node at will in fact divide the locked k/+i 
range, so the node-splitting operation must request a 
lock on the k/+i range before splitting the node, just as 
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though a record containing k// m / ; were being inserted, if 
restructuring is to avoid violating the serializability 
requirement by dividing a range that has resulted from 
an uncommitted transaction's deletion operation. Lock- 
ing is similarly required when nodes need to be merged: 
deletion of k// m /r of the lower-order node extends the 
range of the first existing key value ki of the higher- 
order node, so the higher-order node's ki needs to be 
locked for deletion. 

This aspect of the invention therefore introduces a 
complication, but the resulting concurrency penalty is 
no worse than that which results from performing an 
actual insert operation, and it occurs only upon (rela- 
tively infrequent) node splits. Clearly, this is a small 
price to pay for the reduction in access time that is 
obtained between node splits. Moreover, the concur- 
rency penalties exacted by this locking for node split- 
ting are particularly low if one employs the lock-mode 
set described in our commonly assigned U.S. patent 
application for Key-Range Locking Providing Im- 
proved Concurrency, filed on even date herewith, 
which we hereby incorporate by reference. With that 
lock-mode set, locks intended to lock a key for singleton 
operations never interfere with node merging or split- 
ting; only scan operations involving the range in which 
the restructuring is to occur impede that restructuring. 

We now turn to certain embellishments that are bene- 
ficial in embodiments of the present invention. These 
concern the overflow buckets, such as bucket 60 of 
FIG. 6, that one can use in implementing a bounded-dis- 
order index. If the primary bucket into which the hash 
function directs an insertion is full, the insertion takes 
place in the overflow bucket. One could treat the over- 
flow bucket substantially as one treats the primary 
buckets. That is, range locks acquired by operations 
that end up in the overflow bucket would lock ranges 
defined by key values in that bucket This approach has 
the virtue of algorithmic simplicity. But it presents cer- 
tain problems when ranges are to be locked and when 
records in the overflow bucket need to be re-absorbed 
into a primary bucket because of, say, node splitting or 
the type of elastic bucket expansion described in the 
Lomet paper. 

The problem that overflow buckets present to range 
locking is that an overflow bucket will often contain 
few entries, perhaps none. The lock granularity thus 
becomes coarse, with the result that, say, a deletion of a 
single entry from an overflow bucket could require that 
the node's entire key range be locked. 

The problem with record re-absorption is that it in- 
volves a deletion and an insertion operation for each 
record re-absorbed, and each such operation requires 
range locking just as though it were truly an operation 
for a new record. The resultant large number of locks 
may make it difficult or impossible to absorb the re- 
cords. 

At the cost of only a little additional algorithmic 
complexity, these difficulties can be avoided by treating 
overflow buckets differently from primary buckets. 
Qne way of doing this is to use the overflow buckets in 
such a way that they can be thought of simply as spaces 
in which to grow extensions of the primary buckets. 
FIG. 9 depicts the resulting node organization. In FIG. 
9, the ith primary bucket 90 is depicted as containing a 
sequence of key-value entries ki through kj> while the 
overflow bucket segregates its records in accordance 
with the primary buckets into which they would have 
landed if there had been room. (Actual physical separa- 
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tion is not necessary; it would suffice, for instance, to belong to lockable key-value ranges, the ranges them- 

include in each overflow-bucket record a tag to indicate selves are defined only by the key values in the primary 

the primary bucket into which it would have been bucket: the overflow-bucket key values do not define 

placed if there had been room.) FIG. 9 depicts the re- those ranges. So when a key value is to be inserted into 

gionofa overflow bucket 92 that contains the overflow 5 or deleted from the overflow bucket, the only range 

from primary bucket 90, among others. FIG. 9 shows that is locked is the one represented by the next-higher 

that the ordered sequence of records in that region start key value in the corresponding primary bucket. Identi- 

at key value k y -+ 1, where k^ <k r when q<r. That is, key fying that next key value in the primary bucket involves 

values of all records in the overflow bucket that would no I/O cost, because the primary bucket will have to 

have been in primary bucket i if there had been room 10 have been read already in order to determine that it was 

are greater than all of the key values in the ith primary full or did not contain the key value to be deleted, 

bucket itself. Scanning in this approach differs only in that access- 

For range-locking purposes, one considers the over- ing an overflow-bucket key value requires no separate 

flow-bucket contents associated with a given primary locking. As the Lomet paper indicates, access to a key 

bucket as part of that primary bucket in the sense that 15 value in such scanning of a bounded-disorder file ordi- 

the first key value in the overflow bucket for a given narily requires that a lock be obtained on the range 

primary bucket becomes the next key for range locking represented by the next-higher key value in the bucket, 

when, for instance, a scan operation reaches the last key Since the key values are being accessed in key-value 

value actually stored in the primary bucket. This ap- order, the lockable range to which an overflow-bucket 

proach eliminates the key-range coarseness that would 20 key value belongs will always have been locked al- 

otherwise result when the overflow bucket contains a ready. 

small number of records. Additionally, absorbing over- In short, locking on single-value operations and scans 
flow records into a primary bucket requires no locking is relatively simple in this approach. Yet the coarse 
under this approach, since no ranges change in the pro- granularity that might otherwise result when the over- 
cess. 25 flow bucket contains few entries is avoided because 
This approach does add some algorithmic complex- overflow-bucket key values are not used to define lock- 
ity, and it exacts a performance penalty in certain opera- able ranges. 

tions. Specifically, consider an insertion operation that Compared with the previous approach, however, this 
is to occur when the primary bucket into which the approach is more complex when restructuring necessi- 
hashing function initially directs it is full. Such an oper- 30 tates re-absorption of the overflow bucket into the pri- 
ation always takes two read operations and a write mary buckets. In the previous approach, in which order 
operation, since the primary bucket needs to be read to is maintained between a primary bucket and its over- 
determine that it is full and the overflow bucket needs flow-bucket extension, and in which all key values 
to be read and written because that bucket's contents bound lockable ranges, the lockable ranges do not 
will be modified as a result of the insertion. This is true 35 change when the primary buckets absorb the overflow- 
regardless of whether the overflow bucket is treated bucket contents. But this is not true of the alternate 
separately, for range-locking purposes, from a primary approach, and the resulting need to divide lockable 
bucket. ranges could mean that re-absorption would be blocked 

However, if the approach just described is employed, any time there is a lock on any one of the possibly large 
in which an overflow bucket is treated simply as con- 40 number of lockable ranges into which index records 

tabling extensions of the primary buckets, then the in- from the overflow bucket are to be inserted, 

sertion into the primary bucket may have to be per- To avoid this difficulty, the initial re-absorption can 

formed to keep the proper key-value order between the be performed by placing the erstwhile overflow-bucket 

primary bucket and its extension in the overflow values physically into their respective primary buckets 
bucket, so a further write operation on the primary 45 without necessarily using them to define lockable 

bucket will be required as well. Even though the inser- ranges right away. For instance, a primary bucket can 

tion occurs in the primary bucket, the need to write the have an overflow section set aside for key values that 

overflow bucket remains, since it must be updated with have been returned from an overflow bucket but are not 

the high key value that the insertion forced out of the yet to be used in defining lockable ranges. Since the 
primary bucket But inserts that overflow account for 50 values that are placed in this section are not initially 

less than 10% of all inserts. Since the change in I/O cost used to define lockable key-value ranges, locking does 

for such inserts is 33% under this approach, this ap- not have to be performed in order to place them into the 

proach exacts an additional I/O cost over all inserts of primary bucket, so the restructuring that necessitates 

only 3.3%, which is not significant in view of the con- the re-absorption can proceed regardless of the exis- 
currency increase that it yields. 55 tence of locks. The key values from the overflow sec- 

An alternative approach avoids this I/O cost, albeit at tion can then be promoted to "full-fledged," lockable- 
the expense of some concurrency loss. This approach, range-defining primary-bucket values whenever it sub- 
too, separately stores differently hashed overflow- sequently becomes convenient to do so. 
bucket key values or otherwise associates them with the Reflection reveals that this approach does not require 
primary buckets into which they would have been 60 the primary bucket to have a physically separate over- 
placed if there had been room. But no attempt is made flow section. Instead, the re-absorbed key values can be 
in this approach to maintain key-value order between a located where they would be if they were "full- 
primary bucket and the key values associated with it in fledged" key values, but their entries can include flags 
the overflow bucket: a new key value to be inserted that identify them as not being among the values to be 
when the primary bucket is full goes into the overflow 65 used in defining lockable ranges, 
bucket even if key values exist both above and below it The present invention thus makes it feasible to obtain 
in the primary bucket Instead, although the overflow- the benefits of bounded-disorder indexes in concurrent- 
bucket key values are considered in this approach to transaction databases, and it speeds the locking process 
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for end-of-bucket insertions and deletions. It therefore 
constitutes a significant advance in the art. 
We claim: 

1. In a resource-management system for storing data 
files of data records that include key fields containing 5 
respective key values, for maintaining, for respective 
data files, bounded-disorder key indexes comprising 
index records organized into trees comprising respec- 
tive hierarchies of nodes, including-leaf nodes divided 
into respective pluralities of buckets including primary 10 
buckets, of leaf-node records containing respective ones 

of the key values contained in the data records, each 
primary bucket containing index records that contain 
key values that a hash function associates with that 
bucket, for performing insert and delete operations that 15 
lock key-valued lockable ranges, associated with re- 
spective key values that bound the lockable ranges with 
which they are associated and into which or from 
which the insert and delete operations respectively 
insert or delete values, and for performing scan opera- 20 
tions that scan target ranges and lock the key-valued 
lockable ranges that overlap the target ranges, the im- 
provement wherein a scan operation locks the key- 
valued lockable range associated with an existing key 
value in a given bucket if the range extending between 25 
that key value and the existing key value in front of it in 
the same bucket overlaps the scan operation's target 
range and an insert or delete operation that inserts or 
deletes in a bucket a key value in front of an existing key 
value in that bucket locks the key-valued lockable range 30 
associated with that existing key value without locking 
a key-valued lockable range associated with any inter- 
vening key value in any other bucket. 

2. A resource-management system as defined in claim 

1 wherein: 35 

A) at least one leaf node further includes an overflow 
bucket that contains at least one index record 
whose key value the hash function associates with 
a primary bucket of the same leaf node and with 
which a lockable range is associated; 40 

B) the key value of no index record in the overflow 
bucket falls between key values contained by the 
primary bucket with which the hash function asso- 
ciates that key value in the overflow bucket; 

C) a scan operation locks a key-valued lockable range 45 
associated with one existing key value in the over- 
flow bucket if the range extending between the one 
existing key value, which the hash function associ- 
ates with a given primary bucket, and the existing 
key value that the hash function associates with the 50 
same primary bucket and that is contained in front 
of the one key value in the overflow bucket over- 
laps the scan operation's target range; and 

D) an insert or delete operation that inserts or deletes 

in the overflow bucket, in front of an existing key 55 
value that is contained in the overflow bucket and 
is associated by the hash function with a given 
primary bucket, a key value that the hash function 
associates with the same primary bucket locks a 
key-valued lockable range associated with that 60 
existing key value without locking a key-valued 
lockable range associated with any intervening key 
value that the hash function associates with any 
other primary bucket. 

3. A resource-management system as defined in claim 65 
1 wherein: 

A) at least one leaf node further includes an overflow 
bucket, which contains at least one index record 
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whose key value the hash function associates with 
a primary bucket of the same leaf node and which 
bounds no lockable range; 
B) an insert or delete operation that inserts or deletes 
in the overflow bucket a key value that the hash 
function associates with a given primary bucket 
locks the key- valued range associated with a key 
value contained in that primary bucket. 

4. A resource-management system as defined in claim 
1 wherein: 

A) each bucket is associated with a respective bucket 
key-value range bounded by respective key-value 
limits; 

B) each leaf-node record contains a respective one of 
the key values that are contained in the data re- 
cords and fall within the bucket key-value ranges 
associated with the bucket that contains that leaf- 
node record; and 

C) an insert or delete operation directed to the last 
existing key value in a bucket locks a key-valued 
lockable range bounded by the key-value limit that 
bounds the bucket key-value range of that bucket. 

5. A resource-management system as defined in claim 
4 wherein the resource-management system further 
includes means for splitting a predecessor bucket into 
successor buckets by: 

A) locking a key-valued lockable range, bounded by 
a split key value, in the predecessor bucket; and 

B) subsequently placing into a pair of successor buck- 
ets all index records from the predecessor bucket 
whose key values are respectively less than and 
greater, than the split key value. 

6. In a resource-management system for storing data 
files of data records that include key fields containing 
respective key values, for accessing the data records in 
the nonvolatile memory, for maintaining, for respective 
data files, key indexes comprising index records orga- 
nized into trees comprising respective hierarchies of 
nodes, including leaf nodes, stored in the nonvolatile 
memory, that comprise leaf-node records that the re- 
source-management system accesses in buckets thereof 
respectively associated with bucket key-value ranges 
bounded by respective key-value limits, each leaf-node 
record containing a respective one of the key values 
that are contained in the data records and fall within the 
bucket key-value range associated with that leaf-node 
record's bucket, for performing insert and delete opera- 
tions that lock key-valued lockable ranges, associated 
with respective key values that bound the lockable 
ranges with which they are associated and into which 
or from which the insert and delete operations respec- 
tively insert or delete values, and for performing scan 
operations that scan target ranges and lock the key- 
valued lockable ranges that overlap the target ranges, 
the improvement wherein an insert or delete operation 
directed to the last existing key value in a bucket locks 
a key-valued lockable range associated with the key- 
value limit that bounds the bucket key-value range of 
that bucket. 

7. A resource-management system as defined in claim 

6 wherein the indexes are bounded-disorder key in- 
dexes, in which each leaf node is divided into a plurality 
of the buckets including primary buckets, each primary 
bucket containing index records that contain key values 
that a hash function associates with that bucket. 

8. A resource-management system as defined in claim 

7 wherein the resource-management system further 
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includes means for splitting a predecessor bucket into 
successor buckets by: 

A) locking a key-valued lockable range, bounded by 
a split key value, in the predecessor bucket; and 

B) subsequently placing into a pair of successor buck- 5 
ets all index records from the predecessor node . 
whose key values are respectively less than and 
greater than the split key value. 

9. A resource-management system as defined in claim 
6 wherein the resource-management system further 10 



includes means for splitting a predecessor bucket into 
successor buckets by: 

A) locking a key-valued lockable range, associated 
with a split key value, in the predecessor bucket; 
and 

B) placing into a pair of successor buckets all index 
records from the predecessor node whose key val- 
ues are respectively less than and greater than the 
split key value. 
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